{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "l-23gBrt4x2B"
      },
      "source": [
        "##### Copyright 2021 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "HMUDt0CiUJk9"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "77z2OchJTk0l"
      },
      "source": [
        "# Migrate `tf.feature_column`s to Keras preprocessing layers\n",
        "\n",
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://www.tensorflow.org/guide/migrate/migrating_feature_columns\">\n",
        "    <img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />\n",
        "    View on TensorFlow.org</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/migrate/migrating_feature_columns.ipynb\">\n",
        "    <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />\n",
        "    Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/docs/blob/master/site/en/guide/migrate/migrating_feature_columns.ipynb\">\n",
        "    <img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />\n",
        "    View source on GitHub</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://storage.googleapis.com/tensorflow_docs/docs/site/en/guide/migrate/migrating_feature_columns.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-5jGPDA2PDPI"
      },
      "source": [
        "Training a model usually comes with some amount of feature preprocessing, particularly when dealing with structured data. When training a `tf.estimator.Estimator` in TensorFlow 1, you usually perform feature preprocessing with the `tf.feature_column` API. In TensorFlow 2, you can do this directly with Keras preprocessing layers.\n",
        "\n",
        "This migration guide demonstrates common feature transformations using both feature columns and preprocessing layers, followed by training a complete model with both APIs.\n",
        "\n",
        "First, start with a couple of necessary imports:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iE0vSfMXumKI"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "import tensorflow.compat.v1 as tf1\n",
        "import math"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NVPYTQAWtDwH"
      },
      "source": [
        "Now, add a utility function for calling a feature column for demonstration:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LAaifuuytJjM"
      },
      "outputs": [],
      "source": [
        "def call_feature_columns(feature_columns, inputs):\n",
        "  # This is a convenient way to call a `feature_column` outside of an estimator\n",
        "  # to display its output.\n",
        "  feature_layer = tf1.keras.layers.DenseFeatures(feature_columns)\n",
        "  return feature_layer(inputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZJnw07hYDGYt"
      },
      "source": [
        "## Input handling\n",
        "\n",
        "To use feature columns with an estimator, model inputs are always expected to be a dictionary of tensors:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "y0WUpQxsKEzf"
      },
      "outputs": [],
      "source": [
        "input_dict = {\n",
        "  'foo': tf.constant([1]),\n",
        "  'bar': tf.constant([0]),\n",
        "  'baz': tf.constant([-1])\n",
        "}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xYsC6H_BJ8l3"
      },
      "source": [
        "Each feature column needs to be created with a key to index into the source data. The output of all feature columns is concatenated and used by the estimator model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3fvIe3V8Ffjt"
      },
      "outputs": [],
      "source": [
        "columns = [\n",
        "  tf1.feature_column.numeric_column('foo'),\n",
        "  tf1.feature_column.numeric_column('bar'),\n",
        "  tf1.feature_column.numeric_column('baz'),\n",
        "]\n",
        "call_feature_columns(columns, input_dict)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hvPfCK2XGTyl"
      },
      "source": [
        "In Keras, model input is much more flexible. A `tf.keras.Model` can handle a single tensor input, a list of tensor features, or a dictionary of tensor features. You can handle dictionary input by passing a dictionary of `tf.keras.Input` on model creation. Inputs will not be concatenated automatically, which allows them to be used in much more flexible ways. They can be concatenated with `tf.keras.layers.Concatenate`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5sYWENkgLWJ2"
      },
      "outputs": [],
      "source": [
        "inputs = {\n",
        "  'foo': tf.keras.Input(shape=()),\n",
        "  'bar': tf.keras.Input(shape=()),\n",
        "  'baz': tf.keras.Input(shape=()),\n",
        "}\n",
        "# Inputs are typically transformed by preprocessing layers before concatenation.\n",
        "outputs = tf.keras.layers.Concatenate()(inputs.values())\n",
        "model = tf.keras.Model(inputs=inputs, outputs=outputs)\n",
        "model(input_dict)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GXkmiuwXTS-B"
      },
      "source": [
        "## One-hot encoding integer IDs\n",
        "\n",
        "A common feature transformation is one-hot encoding integer inputs of a known range. Here is an example using feature columns:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XasXzOgatgRF"
      },
      "outputs": [],
      "source": [
        "categorical_col = tf1.feature_column.categorical_column_with_identity(\n",
        "    'type', num_buckets=3)\n",
        "indicator_col = tf1.feature_column.indicator_column(categorical_col)\n",
        "call_feature_columns(indicator_col, {'type': [0, 1, 2]})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iSCkJEQ6U-ru"
      },
      "source": [
        "Using Keras preprocessing layers, these columns can be replaced by a single `tf.keras.layers.CategoryEncoding` layer with `output_mode` set to `'one_hot'`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "799lbMNNuAVz"
      },
      "outputs": [],
      "source": [
        "one_hot_layer = tf.keras.layers.CategoryEncoding(\n",
        "    num_tokens=3, output_mode='one_hot')\n",
        "one_hot_layer([0, 1, 2])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kNzRtESU7tga"
      },
      "source": [
        "Note: For large one-hot encodings, it is much more efficient to use a sparse representation of the output. If you pass `sparse=True` to the `CategoryEncoding` layer, the output of the layer will be a `tf.sparse.SparseTensor`, which can be efficiently handled as input to a `tf.keras.layers.Dense` layer."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Zf7kjhTiAErK"
      },
      "source": [
        "## Normalizing numeric features\n",
        "\n",
        "When handling continuous, floating-point features with feature columns, you need to use a `tf.feature_column.numeric_column`. In the case where the input is already normalized, converting this to Keras is trivial. You can simply use a `tf.keras.Input` directly into your model, as shown above.\n",
        "\n",
        "A `numeric_column` can also be used to normalize input:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HbTMGB9XctGx"
      },
      "outputs": [],
      "source": [
        "def normalize(x):\n",
        "  mean, variance = (2.0, 1.0)\n",
        "  return (x - mean) / math.sqrt(variance)\n",
        "numeric_col = tf1.feature_column.numeric_column('col', normalizer_fn=normalize)\n",
        "call_feature_columns(numeric_col, {'col': tf.constant([[0.], [1.], [2.]])})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M9cyhPR_drOz"
      },
      "source": [
        "In contrast, with Keras, this normalization can be done with `tf.keras.layers.Normalization`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8bcgG-yOdqUH"
      },
      "outputs": [],
      "source": [
        "normalization_layer = tf.keras.layers.Normalization(mean=2.0, variance=1.0)\n",
        "normalization_layer(tf.constant([[0.], [1.], [2.]]))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d1InD_4QLKU-"
      },
      "source": [
        "## Bucketizing and one-hot encoding numeric features"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "k5e0b8iOLRzd"
      },
      "source": [
        "Another common transformation of continuous, floating point inputs is to bucketize then to integers of a fixed range.\n",
        "\n",
        "In feature columns, this can be achieved with a `tf.feature_column.bucketized_column`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_rbx6qQ-LQx7"
      },
      "outputs": [],
      "source": [
        "numeric_col = tf1.feature_column.numeric_column('col')\n",
        "bucketized_col = tf1.feature_column.bucketized_column(numeric_col, [1, 4, 5])\n",
        "call_feature_columns(bucketized_col, {'col': tf.constant([1., 2., 3., 4., 5.])})\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PCYu-XtwXahx"
      },
      "source": [
        "In Keras, this can be replaced by `tf.keras.layers.Discretization`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QK1WOG2uVVsL"
      },
      "outputs": [],
      "source": [
        "discretization_layer = tf.keras.layers.Discretization(bin_boundaries=[1, 4, 5])\n",
        "one_hot_layer = tf.keras.layers.CategoryEncoding(\n",
        "    num_tokens=4, output_mode='one_hot')\n",
        "one_hot_layer(discretization_layer([1., 2., 3., 4., 5.]))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5bm9tJZAgpt4"
      },
      "source": [
        "## One-hot encoding string data with a vocabulary\n",
        "\n",
        "Handling string features often requires a vocabulary lookup to translate strings into indices. Here is an example using feature columns to lookup strings and then one-hot encode the indices:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3fG_igjhukCO"
      },
      "outputs": [],
      "source": [
        "vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(\n",
        "    'sizes',\n",
        "    vocabulary_list=['small', 'medium', 'large'],\n",
        "    num_oov_buckets=0)\n",
        "indicator_col = tf1.feature_column.indicator_column(vocab_col)\n",
        "call_feature_columns(indicator_col, {'sizes': ['small', 'medium', 'large']})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8rBgllRtY738"
      },
      "source": [
        "Using Keras preprocessing layers, use the `tf.keras.layers.StringLookup` layer with `output_mode` set to `'one_hot'`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "arnPlSrWvDMe"
      },
      "outputs": [],
      "source": [
        "string_lookup_layer = tf.keras.layers.StringLookup(\n",
        "    vocabulary=['small', 'medium', 'large'],\n",
        "    num_oov_indices=0,\n",
        "    output_mode='one_hot')\n",
        "string_lookup_layer(['small', 'medium', 'large'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f76MVVYO8LB5"
      },
      "source": [
        "Note: For large one-hot encodings, it is much more efficient to use a sparse representation of the output. If you pass `sparse=True` to the `StringLookup` layer, the output of the layer will be a `tf.sparse.SparseTensor`, which can be efficiently handled as input to a `tf.keras.layers.Dense` layer."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "c1CmfSXQZHE5"
      },
      "source": [
        "## Embedding string data with a vocabulary\n",
        "\n",
        "For larger vocabularies, an embedding is often needed for good performance. Here is an example embedding a string feature using feature columns:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "C3RK4HFazxlU"
      },
      "outputs": [],
      "source": [
        "vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(\n",
        "    'col',\n",
        "    vocabulary_list=['small', 'medium', 'large'],\n",
        "    num_oov_buckets=0)\n",
        "embedding_col = tf1.feature_column.embedding_column(vocab_col, 4)\n",
        "call_feature_columns(embedding_col, {'col': ['small', 'medium', 'large']})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3aTRVJ6qZZH0"
      },
      "source": [
        "Using Keras preprocessing layers, this can be achieved by combining a `tf.keras.layers.StringLookup` layer and an `tf.keras.layers.Embedding` layer. The default output for the `StringLookup` will be integer indices which can be fed directly into an embedding.\n",
        "\n",
        "Note: The `Embedding` layer contains trainable parameters. While the `StringLookup` layer can be applied to data inside or outside of a model, the `Embedding` must always be part of a trainable Keras model to train correctly."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8resGZPo0Fho"
      },
      "outputs": [],
      "source": [
        "string_lookup_layer = tf.keras.layers.StringLookup(\n",
        "    vocabulary=['small', 'medium', 'large'], num_oov_indices=0)\n",
        "embedding = tf.keras.layers.Embedding(3, 4)\n",
        "embedding(string_lookup_layer(['small', 'medium', 'large']))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UwqvADV6HRdC"
      },
      "source": [
        "## Summing weighted categorical data\n",
        "\n",
        "In some cases, you need to deal with categorical data where each occurance of a category comes with an associated weight. In feature columns, this is handled with `tf.feature_column.weighted_categorical_column`. When paired with an `indicator_column`, this has the effect of summing weights per category."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "02HqjPLMRxWn"
      },
      "outputs": [],
      "source": [
        "ids = tf.constant([[5, 11, 5, 17, 17]])\n",
        "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
        "\n",
        "categorical_col = tf1.feature_column.categorical_column_with_identity(\n",
        "    'ids', num_buckets=20)\n",
        "weighted_categorical_col = tf1.feature_column.weighted_categorical_column(\n",
        "    categorical_col, 'weights')\n",
        "indicator_col = tf1.feature_column.indicator_column(weighted_categorical_col)\n",
        "call_feature_columns(indicator_col, {'ids': ids, 'weights': weights})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "98jaq7Q3S9aG"
      },
      "source": [
        "In Keras, this can be done by passing a `count_weights` input to `tf.keras.layers.CategoryEncoding` with `output_mode='count'`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JsoYUUgRS7hu"
      },
      "outputs": [],
      "source": [
        "ids = tf.constant([[5, 11, 5, 17, 17]])\n",
        "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
        "\n",
        "# Using sparse output is more efficient when `num_tokens` is large.\n",
        "count_layer = tf.keras.layers.CategoryEncoding(\n",
        "    num_tokens=20, output_mode='count', sparse=True)\n",
        "tf.sparse.to_dense(count_layer(ids, count_weights=weights))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gBJxb6y2GasI"
      },
      "source": [
        "## Embedding weighted categorical data\n",
        "\n",
        "You might alternately want to embed weighted categorical inputs. In feature columns, the `embedding_column` contains a `combiner` argument. If any sample\n",
        "contains multiple entries for a category, they will be combined according to the argument setting (by default `'mean'`)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AjOt1wgmT5mM"
      },
      "outputs": [],
      "source": [
        "ids = tf.constant([[5, 11, 5, 17, 17]])\n",
        "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
        "\n",
        "categorical_col = tf1.feature_column.categorical_column_with_identity(\n",
        "    'ids', num_buckets=20)\n",
        "weighted_categorical_col = tf1.feature_column.weighted_categorical_column(\n",
        "    categorical_col, 'weights')\n",
        "embedding_col = tf1.feature_column.embedding_column(\n",
        "    weighted_categorical_col, 4, combiner='mean')\n",
        "call_feature_columns(embedding_col, {'ids': ids, 'weights': weights})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fd6eluARXndC"
      },
      "source": [
        "In Keras, there is no `combiner` option to `tf.keras.layers.Embedding`, but you can achieve the same effect with `tf.keras.layers.Dense`. The `embedding_column` above is simply linearly combining embedding vectors according to category weight. Though not obvious at first, it is exactly equivalent to representing your categorical inputs as a sparse weight vector of size `(num_tokens)`, and multiplying them by a `Dense` kernel of shape `(embedding_size, num_tokens)`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Y-vZvPyiYilE"
      },
      "outputs": [],
      "source": [
        "ids = tf.constant([[5, 11, 5, 17, 17]])\n",
        "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n",
        "\n",
        "# For `combiner='mean'`, normalize your weights to sum to 1. Removing this line\n",
        "# would be equivalent to an `embedding_column` with `combiner='sum'`.\n",
        "weights = weights / tf.reduce_sum(weights, axis=-1, keepdims=True)\n",
        "\n",
        "count_layer = tf.keras.layers.CategoryEncoding(\n",
        "    num_tokens=20, output_mode='count', sparse=True)\n",
        "embedding_layer = tf.keras.layers.Dense(4, use_bias=False)\n",
        "embedding_layer(count_layer(ids, count_weights=weights))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3I5loEx80MVm"
      },
      "source": [
        "## Complete training example\n",
        "\n",
        "To show a complete training workflow, first prepare some data with three features of different types:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "D_7nyBee0ZBV"
      },
      "outputs": [],
      "source": [
        "features = {\n",
        "    'type': [0, 1, 1],\n",
        "    'size': ['small', 'small', 'medium'],\n",
        "    'weight': [2.7, 1.8, 1.6],\n",
        "}\n",
        "labels = [1, 1, 0]\n",
        "predict_features = {'type': [0], 'size': ['foo'], 'weight': [-0.7]}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e_4Xx2c37lqD"
      },
      "source": [
        "Define some common constants for both TensorFlow 1 and TensorFlow 2 workflows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3cyfQZ7z8jZh"
      },
      "outputs": [],
      "source": [
        "vocab = ['small', 'medium', 'large']\n",
        "one_hot_dims = 3\n",
        "embedding_dims = 4\n",
        "weight_mean = 2.0\n",
        "weight_variance = 1.0"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ywCgU7CMIfTH"
      },
      "source": [
        "### With feature columns\n",
        "\n",
        "Feature columns must be passed as a list to the estimator on creation, and will be called implicitly during training."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Wsdhlm-uipr1"
      },
      "outputs": [],
      "source": [
        "categorical_col = tf1.feature_column.categorical_column_with_identity(\n",
        "    'type', num_buckets=one_hot_dims)\n",
        "# Convert index to one-hot; e.g. [2] -> [0,0,1].\n",
        "indicator_col = tf1.feature_column.indicator_column(categorical_col)\n",
        "\n",
        "# Convert strings to indices; e.g. ['small'] -> [1].\n",
        "vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(\n",
        "    'size', vocabulary_list=vocab, num_oov_buckets=1)\n",
        "# Embed the indices.\n",
        "embedding_col = tf1.feature_column.embedding_column(vocab_col, embedding_dims)\n",
        "\n",
        "normalizer_fn = lambda x: (x - weight_mean) / math.sqrt(weight_variance)\n",
        "# Normalize the numeric inputs; e.g. [2.0] -> [0.0].\n",
        "numeric_col = tf1.feature_column.numeric_column(\n",
        "    'weight', normalizer_fn=normalizer_fn)\n",
        "\n",
        "estimator = tf1.estimator.DNNClassifier(\n",
        "    feature_columns=[indicator_col, embedding_col, numeric_col],\n",
        "    hidden_units=[1])\n",
        "\n",
        "def _input_fn():\n",
        "  return tf1.data.Dataset.from_tensor_slices((features, labels)).batch(1)\n",
        "\n",
        "estimator.train(_input_fn)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qPIeG_YtfNV1"
      },
      "source": [
        "The feature columns will also be used to transform input data when running inference on the model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "K-AIIB8CfSqt"
      },
      "outputs": [],
      "source": [
        "def _predict_fn():\n",
        "  return tf1.data.Dataset.from_tensor_slices(predict_features).batch(1)\n",
        "\n",
        "next(estimator.predict(_predict_fn))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "baMA01cBIivo"
      },
      "source": [
        "### With Keras preprocessing layers\n",
        "\n",
        "Keras preprocessing layers are more flexible in where they can be called. A layer can be applied directly to tensors, used inside a `tf.data` input pipeline, or built directly into a trainable Keras model.\n",
        "\n",
        "In this example, you will apply preprocessing layers inside a `tf.data` input pipeline. To do this, you can define a separate `tf.keras.Model` to preprocess your input features. This model is not trainable, but is a convenient way to group preprocessing layers."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NMz8RfMQdCZf"
      },
      "outputs": [],
      "source": [
        "inputs = {\n",
        "  'type': tf.keras.Input(shape=(), dtype='int64'),\n",
        "  'size': tf.keras.Input(shape=(), dtype='string'),\n",
        "  'weight': tf.keras.Input(shape=(), dtype='float32'),\n",
        "}\n",
        "# Convert index to one-hot; e.g. [2] -> [0,0,1].\n",
        "type_output = tf.keras.layers.CategoryEncoding(\n",
        "      one_hot_dims, output_mode='one_hot')(inputs['type'])\n",
        "# Convert size strings to indices; e.g. ['small'] -> [1].\n",
        "size_output = tf.keras.layers.StringLookup(vocabulary=vocab)(inputs['size'])\n",
        "# Normalize the numeric inputs; e.g. [2.0] -> [0.0].\n",
        "weight_output = tf.keras.layers.Normalization(\n",
        "      axis=None, mean=weight_mean, variance=weight_variance)(inputs['weight'])\n",
        "outputs = {\n",
        "  'type': type_output,\n",
        "  'size': size_output,\n",
        "  'weight': weight_output,\n",
        "}\n",
        "preprocessing_model = tf.keras.Model(inputs, outputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NRfISnj3NGlW"
      },
      "source": [
        "Note: As an alternative to supplying a vocabulary and normalization statistics on layer creation, many preprocessing layers provide an `adapt()` method for learning layer state directly from the input data. See the [preprocessing guide](https://www.tensorflow.org/guide/keras/preprocessing_layers#the_adapt_method) for more details.\n",
        "\n",
        "You can now apply this model inside a call to `tf.data.Dataset.map`. Please note that the function passed to `map` will automatically be converted into\n",
        "a `tf.function`, and usual caveats for writing `tf.function` code apply (no side effects)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c_6xAUnbNREh"
      },
      "outputs": [],
      "source": [
        "# Apply the preprocessing in tf.data.Dataset.map.\n",
        "dataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(1)\n",
        "dataset = dataset.map(lambda x, y: (preprocessing_model(x), y),\n",
        "                      num_parallel_calls=tf.data.AUTOTUNE)\n",
        "# Display a preprocessed input sample.\n",
        "next(dataset.take(1).as_numpy_iterator())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8_4u3J4NdJ8R"
      },
      "source": [
        "Next, you can define a separate `Model` containing the trainable layers. Note how the inputs to this model now reflect the preprocessed feature types and shapes."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kC9OZO5ldmP-"
      },
      "outputs": [],
      "source": [
        "inputs = {\n",
        "  'type': tf.keras.Input(shape=(one_hot_dims,), dtype='float32'),\n",
        "  'size': tf.keras.Input(shape=(), dtype='int64'),\n",
        "  'weight': tf.keras.Input(shape=(), dtype='float32'),\n",
        "}\n",
        "# Since the embedding is trainable, it needs to be part of the training model.\n",
        "embedding = tf.keras.layers.Embedding(len(vocab), embedding_dims)\n",
        "outputs = tf.keras.layers.Concatenate()([\n",
        "  inputs['type'],\n",
        "  embedding(inputs['size']),\n",
        "  tf.expand_dims(inputs['weight'], -1),\n",
        "])\n",
        "outputs = tf.keras.layers.Dense(1)(outputs)\n",
        "training_model = tf.keras.Model(inputs, outputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ir-cn2H_d5R7"
      },
      "source": [
        "You can now train the `training_model` with `tf.keras.Model.fit`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6TS3YJ2vnvlW"
      },
      "outputs": [],
      "source": [
        "# Train on the preprocessed data.\n",
        "training_model.compile(\n",
        "    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True))\n",
        "training_model.fit(dataset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pSaEbOE4ecsy"
      },
      "source": [
        "Finally, at inference time, it can be useful to combine these separate stages into a single model that handles raw feature inputs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QHjbIZYneboO"
      },
      "outputs": [],
      "source": [
        "inputs = preprocessing_model.input\n",
        "outputs = training_model(preprocessing_model(inputs))\n",
        "inference_model = tf.keras.Model(inputs, outputs)\n",
        "\n",
        "predict_dataset = tf.data.Dataset.from_tensor_slices(predict_features).batch(1)\n",
        "inference_model.predict(predict_dataset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "O01VQIxCWBxU"
      },
      "source": [
        "This composed model can be saved as a [SavedModel](https://www.tensorflow.org/guide/saved_model) for later use."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6tsyVZgh7Pve"
      },
      "outputs": [],
      "source": [
        "inference_model.save('model')\n",
        "restored_model = tf.keras.models.load_model('model')\n",
        "restored_model.predict(predict_dataset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IXMBwzggwUjI"
      },
      "source": [
        "Note: Preprocessing layers are not trainable, which allows you to apply them *asynchronously* using `tf.data`. This has performance benefits, as you can both prefetch preprocessed batches, and free up any accelerators to focus on the differentiable parts of a model (learn more in the _Prefetching_ section of the [Better performance with the `tf.data` API](../data_performance.ipynb) guide). As this guide shows, separating preprocessing during training and composing it during inference is a flexible way to leverage these performance gains. However, if your model is small or preprocessing time is negligible, it may be simpler to build preprocessing into a complete model from the start. To do this you can build a single model starting with `tf.keras.Input`, followed by preprocessing layers, followed by trainable layers."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2pjp7Z18gRCQ"
      },
      "source": [
        "## Feature column equivalence table\n",
        "\n",
        "For reference, here is an approximate correspondence between feature columns and\n",
        "Keras preprocessing layers:<table>\n",
        "  <tr>\n",
        "    <th>Feature column</th>\n",
        "    <th>Keras layer</th>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.bucketized_column`</td>\n",
        "    <td>`tf.keras.layers.Discretization`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.categorical_column_with_hash_bucket`</td>\n",
        "    <td>`tf.keras.layers.Hashing`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.categorical_column_with_identity`</td>\n",
        "    <td>`tf.keras.layers.CategoryEncoding`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.categorical_column_with_vocabulary_file`</td>\n",
        "    <td>`tf.keras.layers.StringLookup` or `tf.keras.layers.IntegerLookup`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.categorical_column_with_vocabulary_list`</td>\n",
        "    <td>`tf.keras.layers.StringLookup` or `tf.keras.layers.IntegerLookup`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.crossed_column`</td>\n",
        "    <td>`tf.keras.layers.experimental.preprocessing.HashedCrossing`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.embedding_column`</td>\n",
        "    <td>`tf.keras.layers.Embedding`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.indicator_column`</td>\n",
        "    <td>`output_mode='one_hot'` or `output_mode='multi_hot'`*</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.numeric_column`</td>\n",
        "    <td>`tf.keras.layers.Normalization`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.sequence_categorical_column_with_hash_bucket`</td>\n",
        "    <td>`tf.keras.layers.Hashing`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.sequence_categorical_column_with_identity`</td>\n",
        "    <td>`tf.keras.layers.CategoryEncoding`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.sequence_categorical_column_with_vocabulary_file`</td>\n",
        "    <td>`tf.keras.layers.StringLookup`, `tf.keras.layers.IntegerLookup`, or `tf.keras.layer.TextVectorization`†</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.sequence_categorical_column_with_vocabulary_list`</td>\n",
        "    <td>`tf.keras.layers.StringLookup`, `tf.keras.layers.IntegerLookup`, or `tf.keras.layer.TextVectorization`†</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.sequence_numeric_column`</td>\n",
        "    <td>`tf.keras.layers.Normalization`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`tf.feature_column.weighted_categorical_column`</td>\n",
        "    <td>`tf.keras.layers.CategoryEncoding`</td>\n",
        "  </tr>\n",
        "</table>\n",
        "\n",
        "\\* The `output_mode` can be passed to `tf.keras.layers.CategoryEncoding`, `tf.keras.layers.StringLookup`, `tf.keras.layers.IntegerLookup`, and `tf.keras.layers.TextVectorization`.\n",
        "\n",
        "† `tf.keras.layers.TextVectorization` can handle freeform text input directly (for example, entire sentences or paragraphs). This is not one-to-one replacement for categorical sequence handling in TensorFlow 1, but may offer a convenient replacement for ad-hoc text preprocessing.\n",
        "\n",
        "Note: Linear estimators, such as `tf.estimator.LinearClassifier`, can handle direct categorical input (integer indices) without an `embedding_column` or `indicator_column`. However, integer indices cannot be passed directly to `tf.keras.layers.Dense` or `tf.keras.experimental.LinearModel`. These inputs should be first encoded with `tf.layers.CategoryEncoding` with `output_mode='count'` (and `sparse=True` if the category sizes are large) before calling into `Dense` or `LinearModel`."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AQCJ6lM3YDq_"
      },
      "source": [
        "## Next steps\n",
        "\n",
        " - For more information on Keras preprocessing layers, go to the [Working with preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) guide.\n",
        " - For a more in-depth example of applying preprocessing layers to structured data, refer to the [Classify structured data using Keras preprocessing layers](../../tutorials/structured_data/preprocessing_layers.ipynb) tutorial."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "migrating_feature_columns.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
